The “Avengers” dataset is a collection of data on characters from the Marvel comic book series “The Avengers”. This dataset contains information such as the gender, age, and number of appearances of each character, as well as other details such as their alignment (i.e. hero or villain), their status as an Avenger or not, and the issue number of their first appearance.
In this project, I will explore different statistical techniques for analyzing the “Avengers” dataset. Specifically, I will examine the distribution of a numerical variable, demonstrate the applicability of the Central Limit Theorem using random samples, and investigate various sampling methods that can be used on the dataset. I will also draw conclusions about the strengths and limitations of different sampling methods, and discuss the implications of these results for future analyses of the “Avengers” dataset. # REQUIRE LIBRARY
The data is getting from following url :https://raw.githubusercontent.com/fivethirtyeight/data/master/avengers/avengers.csv
avengers <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/avengers/avengers.csv")For this analysis, I will look at the distribution of the Gender variable, which is categorical, and the Appearances variable, which is numerical. I will create a bar chart to visualize the distribution of Gender, and a histogram to visualize the distribution of Appearances.
library(ggplot2)
# Bar chart of Gender distribution
ggplot(avengers, aes(x = Gender)) +
geom_bar(fill = "steelblue") +
labs(title = "Gender Distribution of Avengers", x = "Gender", y = "Count")# Histogram of Appearances distribution
ggplot(avengers, aes(x = Appearances)) +
geom_histogram(fill = "steelblue", binwidth = 100) +
labs(title = "Distribution of Appearances", x = "Appearances", y = "Count")
The bar chart shows that there are more male Avengers than female, with
a ratio of about 4:1. The histogram shows that the distribution of
Appearances is skewed to the right, with a long tail indicating that
there are a few Avengers who have appeared in a very large number of
issues. # Analysis of Two Variables For this analysis, I will look at
the relationship between the Year and Appearances variables.I will
create a scatter plot to visualize this relationship.
# Scatter plot of Year and Appearances
ggplot(avengers, aes(x = Year, y = Appearances)) +
geom_point(color = "steelblue") +
labs(title = "Relationship Between Year and Appearances", x = "Year", y = "Appearances")
The scatter plot shows that there is a positive relationship between
Year and Appearances, with a few outliers who have appeared in a very
large number of issues. This suggests that as time goes on, Avengers
tend to appear in more issues. However, I should note that this
relationship may be confounded by other factors, such as changes in the
comic book industry or the popularity of the Avengers franchise.
In this part, I want to choose the “Appearances” variable, which represents the number of comic book issues in which the character appeared. I can visualize the distribution using a histogram.
ggplot(avengers, aes(x=Appearances)) +
geom_histogram(binwidth=100) +
xlab("Number of Appearances") +
ylab("Count")
I can see that the distribution is heavily skewed to the right, with a
long tail of characters who appeared in many comic book issues. # Draw
various random samples of the data and show the applicability of the
Central Limit Theorem for this variable. I want to choose the
“Appearances” variable, which represents the number of comic book issues
in which the character appeared. I can visualize the distribution using
a histogram:
n_samples <- 1000 # number of samples to draw
sample_size <- 30 # sample size
sample_means <- numeric(n_samples) # empty vector to store sample means
for (i in 1:n_samples) {
sample <- sample(avengers$Appearances, size=sample_size, replace=TRUE)
sample_means[i] <- mean(sample)
}
ggplot(data.frame(sample_means), aes(x=sample_means)) +
geom_histogram(binwidth=10) +
xlab("Sample Mean") +
ylab("Count")
I find that the distribution of the sample means is approximately
normal, even though the original distribution of the “Appearances”
variable was heavily skewed. This demonstrates the applicability of the
Central Limit Theorem for this variable. # Show how various sampling
methods can be used on your data. What are your conclusions if these
samples are used instead of the whole dataset.
# Simple random sample
set.seed(100)
srs <- avengers %>% sample_n(50)
# Stratified random sample
stratified <- avengers %>% group_by(Gender) %>%
sample_n(size = 10) %>% ungroup()
# Cluster sample
cluster <- avengers %>% slice(1:10)
# Systematic sample
systematic <- avengers[c(1, 11, 21, 31, 41, 51, 61, 71, 81, 91, 101, 111, 121),]
# Convenience sample
convenience <- avengers %>% filter(Appearances >= 100)
srs## # A tibble: 50 × 21
## URL Name/…¹ Appea…² Curre…³ Gender Proba…⁴ Full/…⁵ Year Years…⁶ Honor…⁷
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 http://… "Alias… 121 NO MALE <NA> 6-Feb 2006 9 Full
## 2 http://… "Robbi… 299 NO MALE <NA> 10-Jun 2010 5 Full
## 3 http://… "Marcu… 65 YES MALE <NA> 13-Feb 2013 2 Full
## 4 http://… "Rober… 2089 YES MALE <NA> Sep-63 1963 52 Full
## 5 http://… "Rita … 68 NO FEMALE <NA> Nov-88 1988 27 Honora…
## 6 http://… <NA> 16 NO FEMALE <NA> 5-Jul 2005 10 Full
## 7 http://… "Willi… 123 YES MALE <NA> 5-Apr 2005 10 Full
## 8 http://… "Anya … 108 YES FEMALE <NA> <NA> 1900 115 Academy
## 9 http://… "Steve… 3458 YES MALE <NA> Mar-64 1964 51 Full
## 10 http://… "Marc … 402 NO MALE Sep-87 Jun-88 1988 27 Full
## # … with 40 more rows, 11 more variables: Death1 <chr>, Return1 <chr>,
## # Death2 <chr>, Return2 <chr>, Death3 <chr>, Return3 <chr>, Death4 <chr>,
## # Return4 <chr>, Death5 <chr>, Return5 <chr>, Notes <chr>, and abbreviated
## # variable names ¹`Name/Alias`, ²Appearances, ³`Current?`,
## # ⁴`Probationary Introl`, ⁵`Full/Reserve Avengers Intro`,
## # ⁶`Years since joining`, ⁷Honorary
stratified## # A tibble: 20 × 21
## URL Name/…¹ Appea…² Curre…³ Gender Proba…⁴ Full/…⁵ Year Years…⁶ Honor…⁷
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 http://… Fiona 2 YES FEMALE <NA> <NA> 1900 115 Academy
## 2 http://… Bonita… 83 NO FEMALE <NA> Sep-87 1987 28 Full
## 3 http://… Ava Ay… 49 YES FEMALE <NA> 14-Jan 2014 1 Full
## 4 http://… Monica… 348 YES FEMALE Jan-83 May-83 1983 32 Full
## 5 http://… Jessic… 205 YES FEMALE <NA> 10-Aug 2010 5 Full
## 6 http://… <NA> 28 NO FEMALE <NA> Jun-93 1993 22 Honora…
## 7 http://… Monica… 12 YES FEMALE <NA> 13-Sep 2013 2 Full
## 8 http://… Sharon… 333 NO FEMALE <NA> 10-May 2010 5 Full
## 9 http://… Circe 237 NO FEMALE <NA> Feb-90 1990 25 Full
## 10 http://… Americ… 22 YES FEMALE <NA> 13-Jul 2013 2 Full
## 11 http://… Jacque… 115 NO MALE <NA> Sep-65 1965 50 Full
## 12 http://… Nichol… 77 YES MALE <NA> 13-Apr 2013 2 Full
## 13 http://… Philli… 31 NO MALE <NA> Dec-92 1992 23 Honora…
## 14 http://… Eric O… 88 NO MALE <NA> 10-May 2010 5 Full
## 15 http://… Nathan… 23 NO MALE <NA> 5-Apr 2005 10 Full
## 16 http://… Scott … 217 NO MALE Jan-87 3-Feb 2003 12 Full
## 17 http://… Delroy… 101 NO MALE <NA> Apr-00 2000 15 Full
## 18 http://… James … 533 NO MALE May-84 Sep-84 1984 31 Full
## 19 http://… Loki L… 77 NO MALE <NA> 13-Jul 2013 2 Full
## 20 http://… Wade W… 575 NO MALE <NA> 7-Sep 2007 8 Full
## # … with 11 more variables: Death1 <chr>, Return1 <chr>, Death2 <chr>,
## # Return2 <chr>, Death3 <chr>, Return3 <chr>, Death4 <chr>, Return4 <chr>,
## # Death5 <chr>, Return5 <chr>, Notes <chr>, and abbreviated variable names
## # ¹`Name/Alias`, ²Appearances, ³`Current?`, ⁴`Probationary Introl`,
## # ⁵`Full/Reserve Avengers Intro`, ⁶`Years since joining`, ⁷Honorary
cluster## # A tibble: 10 × 21
## URL Name/…¹ Appea…² Curre…³ Gender Proba…⁴ Full/…⁵ Year Years…⁶ Honor…⁷
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 http://… "Henry… 1269 YES MALE <NA> Sep-63 1963 52 Full
## 2 http://… "Janet… 1165 YES FEMALE <NA> Sep-63 1963 52 Full
## 3 http://… "Antho… 3068 YES MALE <NA> Sep-63 1963 52 Full
## 4 http://… "Rober… 2089 YES MALE <NA> Sep-63 1963 52 Full
## 5 http://… "Thor … 2402 YES MALE <NA> Sep-63 1963 52 Full
## 6 http://… "Richa… 612 YES MALE <NA> Sep-63 1963 52 Honora…
## 7 http://… "Steve… 3458 YES MALE <NA> Mar-64 1964 51 Full
## 8 http://… "Clint… 1456 YES MALE <NA> May-65 1965 50 Full
## 9 http://… "Pietr… 769 YES MALE <NA> May-65 1965 50 Full
## 10 http://… "Wanda… 1214 YES FEMALE <NA> May-65 1965 50 Full
## # … with 11 more variables: Death1 <chr>, Return1 <chr>, Death2 <chr>,
## # Return2 <chr>, Death3 <chr>, Return3 <chr>, Death4 <chr>, Return4 <chr>,
## # Death5 <chr>, Return5 <chr>, Notes <chr>, and abbreviated variable names
## # ¹`Name/Alias`, ²Appearances, ³`Current?`, ⁴`Probationary Introl`,
## # ⁵`Full/Reserve Avengers Intro`, ⁶`Years since joining`, ⁷Honorary
systematic## # A tibble: 13 × 21
## URL Name/…¹ Appea…² Curre…³ Gender Proba…⁴ Full/…⁵ Year Years…⁶ Honor…⁷
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 http://… "Henry… 1269 YES MALE <NA> Sep-63 1963 52 Full
## 2 http://… "Jacqu… 115 NO MALE <NA> Sep-65 1965 50 Full
## 3 http://… "Matth… 197 NO MALE <NA> Aug-75 1975 40 Full
## 4 http://… "Carol… 935 YES FEMALE <NA> Apr-79 1979 36 Full
## 5 http://… "Benja… 2305 NO MALE <NA> Jun-86 1986 29 Full
## 6 http://… "Scott… 217 NO MALE Jan-87 3-Feb 2003 12 Full
## 7 http://… "Ashle… 36 YES FEMALE <NA> Jul-89 1989 26 Full
## 8 http://… "Wade … 575 NO MALE <NA> 7-Sep 2007 8 Full
## 9 http://… <NA> 28 NO FEMALE <NA> Jun-93 1993 22 Honora…
## 10 http://… "Carl … 886 YES MALE <NA> 5-Mar 2005 10 Full
## 11 http://… "Kathe… 132 YES FEMALE <NA> 5-Jun 2005 10 Full
## 12 http://… "Maria… 359 YES FEMALE <NA> 10-May 2010 5 Full
## 13 http://… "John … 31 YES MALE <NA> 10-Dec 2010 5 Full
## # … with 11 more variables: Death1 <chr>, Return1 <chr>, Death2 <chr>,
## # Return2 <chr>, Death3 <chr>, Return3 <chr>, Death4 <chr>, Return4 <chr>,
## # Death5 <chr>, Return5 <chr>, Notes <chr>, and abbreviated variable names
## # ¹`Name/Alias`, ²Appearances, ³`Current?`, ⁴`Probationary Introl`,
## # ⁵`Full/Reserve Avengers Intro`, ⁶`Years since joining`, ⁷Honorary
convenience## # A tibble: 105 × 21
## URL Name/…¹ Appea…² Curre…³ Gender Proba…⁴ Full/…⁵ Year Years…⁶ Honor…⁷
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 http://… "Henry… 1269 YES MALE <NA> Sep-63 1963 52 Full
## 2 http://… "Janet… 1165 YES FEMALE <NA> Sep-63 1963 52 Full
## 3 http://… "Antho… 3068 YES MALE <NA> Sep-63 1963 52 Full
## 4 http://… "Rober… 2089 YES MALE <NA> Sep-63 1963 52 Full
## 5 http://… "Thor … 2402 YES MALE <NA> Sep-63 1963 52 Full
## 6 http://… "Richa… 612 YES MALE <NA> Sep-63 1963 52 Honora…
## 7 http://… "Steve… 3458 YES MALE <NA> Mar-64 1964 51 Full
## 8 http://… "Clint… 1456 YES MALE <NA> May-65 1965 50 Full
## 9 http://… "Pietr… 769 YES MALE <NA> May-65 1965 50 Full
## 10 http://… "Wanda… 1214 YES FEMALE <NA> May-65 1965 50 Full
## # … with 95 more rows, 11 more variables: Death1 <chr>, Return1 <chr>,
## # Death2 <chr>, Return2 <chr>, Death3 <chr>, Return3 <chr>, Death4 <chr>,
## # Return4 <chr>, Death5 <chr>, Return5 <chr>, Notes <chr>, and abbreviated
## # variable names ¹`Name/Alias`, ²Appearances, ³`Current?`,
## # ⁴`Probationary Introl`, ⁵`Full/Reserve Avengers Intro`,
## # ⁶`Years since joining`, ⁷Honorary
I have demonstrated five different sampling methods: simple random sampling, stratified random sampling, cluster sampling, systematic sampling, and convenience sampling.
A simple random sample involves selecting a random subset of the observations from the population. In this case, I have randomly selected 50 characters from the “Avengers” data-set. The representativeness of this sample depends on whether it is truly random and whether it adequately captures the variation in the original dataset.
Stratified random sampling involves dividing the population into subgroups (strata) and selecting a random sample from each subgroup. In this case, I have stratified the data-set by the “Gender” variable and selected 10 characters from each subgroup. This can be a useful sampling method if there are important subgroups in the population that need to be represented in the sample.
Cluster sampling involves dividing the population into clusters and selecting a random sample of clusters to include in the study. In this case, I have selected the first 10 characters in the dataset as a cluster sample. This can be a useful sampling method if the population is geographically or otherwise clustered.
Systematic sampling involves selecting every nth observation from the population. In this case, I have selected every 10th character from the “Avengers” data-set. This can be a useful sampling method if the population is ordered in some way (e.g., alphabetically).
Convenience sampling involves selecting the most readily available observations. In this case, I have selected all characters with 100 or more comic book appearances. This is generally not a representative sampling method, as it is subject to bias based on what is convenient to the researcher.
In conclusion, the choice of sampling method depends on the research question and the characteristics of the population. While some sampling methods can be useful for certain types of populations or research questions, others can introduce bias or inadequately capture the variation in the population. It is important to carefully consider the sampling method and its potential limitations before drawing conclusions based on a sample. ## Use Data wrangling techniques for the appropriate analysis of your data.
avengers_filtered <- avengers %>% filter(!is.na(Gender))
# summarize number of Avengers by gender
avengers_summary <- avengers_filtered %>% group_by(Gender) %>% summarise(count = n())
# view the summary data
avengers_summary## # A tibble: 2 × 2
## Gender count
## <chr> <int>
## 1 FEMALE 58
## 2 MALE 115
# explore the data set
glimpse(avengers)## Rows: 173
## Columns: 21
## $ URL <chr> "http://marvel.wikia.com/Henry_Pym_(Eart…
## $ `Name/Alias` <chr> "Henry Jonathan \"Hank\" Pym", "Janet va…
## $ Appearances <dbl> 1269, 1165, 3068, 2089, 2402, 612, 3458,…
## $ `Current?` <chr> "YES", "YES", "YES", "YES", "YES", "YES"…
## $ Gender <chr> "MALE", "FEMALE", "MALE", "MALE", "MALE"…
## $ `Probationary Introl` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ `Full/Reserve Avengers Intro` <chr> "Sep-63", "Sep-63", "Sep-63", "Sep-63", …
## $ Year <dbl> 1963, 1963, 1963, 1963, 1963, 1963, 1964…
## $ `Years since joining` <dbl> 52, 52, 52, 52, 52, 52, 51, 50, 50, 50, …
## $ Honorary <chr> "Full", "Full", "Full", "Full", "Full", …
## $ Death1 <chr> "YES", "YES", "YES", "YES", "YES", "NO",…
## $ Return1 <chr> "NO", "YES", "YES", "YES", "YES", NA, "Y…
## $ Death2 <chr> NA, NA, NA, NA, "YES", NA, NA, "YES", NA…
## $ Return2 <chr> NA, NA, NA, NA, "NO", NA, NA, "YES", NA,…
## $ Death3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Return3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Death4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Return4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Death5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Return5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Notes <chr> "Merged with Ultron in Rage of Ultron Vo…
# create an interactive scatter plot of the number of appearances by year
avengers %>%
group_by(Year) %>%
summarise(appearances = n()) %>%
plot_ly(x = ~Year, y = ~appearances, type = "scatter", mode = "markers") %>%
add_markers(color = ~appearances, colorscale = "Viridis", size = 5) %>%
layout(xaxis = list(title = "Year"), yaxis = list(title = "Number of Appearances"))In this project, I explored different statistical techniques for analyzing the “Avengers” dataset. I started by examining the distribution of a numerical variable, which allowed us to understand the central tendency and spread of the data. I then demonstrated the applicability of the Central Limit Theorem by drawing various random samples from the dataset and showing that the mean of these samples tends to be normally distributed around the population mean, regardless of the distribution of the population.
Next, I investigated various sampling methods that can be used on the “Avengers” dataset, including simple random sampling, stratified random sampling, cluster sampling, systematic sampling, and convenience sampling. I found that different sampling methods can have varying degrees of representativeness and bias, depending on the structure and characteristics of the dataset. Therefore, it is important to carefully consider the sampling method used in any analysis to ensure the validity and generalizability of the results.
In conclusion, this project demonstrated some of the key statistical concepts and techniques that can be used to analyze the “Avengers” dataset, which can be applied to other datasets as well. By using appropriate statistical methods, I can gain insights and make informed decisions based on data-driven evidence, ultimately improving our understanding of complex phenomena and informing effective strategies for action.